10 May 2022 | 02806 Social Data Analysis & Visualization
By Emma Valen Rian (s217079), Eper Stinner (s217159), Aaron Alberg (s217050)
Access our explanation and data sources here
NOTE: SOME OF THESE VISUALIZATIONS DO NOT RENDER OR CAUSE STRANGE BEHAVIOR OUTSIDE OF GOOGLE CHROME
Melbourne is the second largest city in Australia, and its metropolitan area alone is home to more than a million people. In this article we want to take a closer look at how it is to live in central Melbourne by exploring the relationship between wealth, environmental conditions and access to services. Do the rich people in Melbourne have access to more services and a cleaner living environment? And does Melbournes residents care about living next to green spaces? Stay tuned to explore these questions in further detail with us.
We will be using three different public datasets to perform our analysis of Melbourne's living environment. The first one is a housing dataset, which contains records of houses sold in greater Melbourne over a span of XX years. By looking at house prices in different suburbs of Melbourne we can get an understanding of the wealth of the residents living there. The second dataset describes Melbourne's urban forest, by providing information about every tree's location, age and more. This dataset gives an indication about the amount of green spaces available in the city. The third and last dataset we will take a look at is an overview of different services in the city and their location. This dataset provides information about services like transportation, health, schools and recreation.
Let us first explore the data with some maps. Can we see some patterns between the location (and price) of houses and the proximity of green spaces and public services?
The following map shows an overview of houses sold in the city of Melbourne. We only want to investigate houses within the boundaries of the 14 main suburbs in the municipality of Melbourne, and the boundary is shown as a blue line here. Hover over a dot to see how much a house costs!
Let's see how prices compare across these neighborhoods! The industrial areas near the ports don't seem to have many house sales in the last few years, which makes sense but makes those regions less interesting to explore. Next let's get an understanding of the relative numbers we're working with.
We clearly have more datapoints for certain neighborhoods and house types, but we don't know for sure if its data bias or not. A quick Google search tells us that Docklands is an industrial area that was abandoned until 20 years ago and now has apartments for rent. A low number of house sales given this background makes sense! We're working with a relatively narrow time range, but let's see if temporal patterns reveal anything interesting. Take a look at this graph of prices over time.
There seems to be a slight increase over time (with a few crazy mansion sales), but inflation is an easy explanation for this (i.e. real house prices in this time frame haven't changed much). What about looking at it on the map? make sure to press play to see trends over time!
Next let's compare prices across the neighborhoods:
OK, this helps us identify wealthier neighborhoods. Areas near the Central Business District seem to trend higher. What about different types of homes?
Pretty wide spreads for all of them, but single family homes trending higher. Makes sense so far!
We don't want to be naive and assume all of our data is perfect. Let's take a look at the 'Distance' property provided by our dataset. Experiment with the toggles and see if you can find anything weird.
You probably noticed that entire neighborhoods have the exact same distance value! Neighborhoods can be pretty widespread, so assigning a uniform value with potential error of ~50% isn't super helpful. Even worse, you may have noticed that houses in central Melbourne fall under both the 0km and 2.8km marker. Clearly this feature is unreliable! We won't be using the distance property.
Let's see where homes are in relation to the green spaces. Hover over the map to see the different neighborhoods.
The (very aptly named) Parkville clearly has a high density of trees and is lined with homes as well. Otherwise, high levels of trees doesn't seem to indicate high number of home sales (nor is the inverse shown). This can indicate that trees aren't a significant factor when it comes to home sales. Let's add the price dimension back in.
Parkville definitely confirms the positive correlation, but the other neighborhoods are still not showing a connection between price and tree density. Maybe access to water ways, business districts, or highways is more important?
This plot looks a lot more like the prices plot. Promising! Let's look at where the landmarks are in relation to the houses. Houses are in blue, landmarks in red
A little overwhelming to look at, but landmarks/places of interest are definitely located near the houses. What kinds of landmarks does this dataset have?
Lots of recreation, places of worship, and transportation. Again makes sense given our spatial distribution.
Now what if we want to predict future house prices based on our historical data? Let's take a few more looks at some features to see how we can measure them. Firstly is our trees and services. How do we measure these spatial datapoints against each other? In this case, we've decided for you: a house is 'near' a tree or landmark if it's within 500m (a reasonable distance for someone to walk). Here's an example of all the trees within half a kilometer of a house in Kensington.
Rooms in houses probably vary a lot by home type, price, and other factors. Let's check out by neighborhood to see the differences before we use it in our machine learning.
Before we get into any machine learning, let's take a look at the data we have and check for missing values and data types.
Not all of this is useful, so we're going to throw some rows away. Fields like 'Address' don't provide much because they are all unique. We already looked for temporal patterns so we won't use 'Date'. 'BuildingArea' is missing for many datapoints, so we'll toss that as well. We'll then drop rows with missing values for important columns like 'Car', 'Landsize', and 'Bedroom2'. Finally, we'll one hot encode the categorical data into binary fields. Take a look below at the stats on the updated data set and the correlations between all of our variables.
Alright let's take a look here. The RMSE of our model is ~$400000 AUD which really isn't a great look. That means that either a) our historical data is not complete enough or b) the variables we have access to do not have a particularly strong correlation to price. Let's take a look at that correlation matrix to see what's up. Some of them are very intuitive, like price being correlated to neighborhood or type of house. Others are intuitive but not in quite a useful way, like having a second bedroom correlating to having more rooms or the fact that being an apartment is negatively correlated to being a house(duh!). There are some interesting patterns to see here though: Services does correlate to postcode, and garage access correlates to rooms and price.
The regression tree above shows how our model is making its decisions to predict house price. The darker the leaf node, the more expensive the house prediction. While we showed its inaccuracies, our model does present trends, especially at the extremes. For very low number of nearby trees, price is almost always lower. Houses in less dense neighborhoods are cheaper, but only if they also have less access to services/landmarks. The model also clearly shows the effect of neighborhood on price.
Overall, we were a little dissappointed to see our hypotheses not confirmed as strongly as we had hoped. Rich people do not have a monopoly on access to all of Melbourne's resources and green spaces. This is also a good thing! And our exploration was a good lesson in data science: if you always find exactly what you are looking for, you're not really doing anything interesting or new. We hope we took you on an interesting and/or entertaining journey through the data of the wonderful city of Melbourne, Australia. Thanks!